{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HTML解析-南方学院新闻 & liepin实践\n",
    "\n",
    "*  本周主要内容：HTML解析（parse HTML）及Xpath实践\n",
    "*  21_Web数据挖掘_week05\n",
    "*  电子讲义设计者:许智超\n",
    "<br/>\n",
    "<br/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 上周回顾及翻页思考\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "from requests_html import HTMLSession\n",
    "import requests_html\n",
    "import pandas as pd\n",
    "import urllib.parse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A1  nfu.edu.cn \n",
    "session = HTMLSession()\n",
    "r = session.get(\"https://www.nfu.edu.cn/mtnf/index.htm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## html 页面数据的存与读"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 存\n",
    "with open (\"html_out/_nfu_文学与传媒学院.html\", encoding = \"utf8\", mode = \"w\") as fp:\n",
    "    fp.write(r.html.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读\n",
    "with open (\"html_out/_nfu_文学与传媒学院.html\", encoding = \"utf8\", mode = \"r\") as fp:\n",
    "    html_load = fp.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## soup_html 解析 ： str的html文件 => element html元素文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Element html at 0x11bef31d8>"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 解析\n",
    "parsed = requests_html.soup_parse(html_load)\n",
    "parsed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 解析和重塑链接（内容链接）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ParseResult(scheme='https', netloc='www.nfu.edu.cn', path='/mtnf/index.htm', params='', query='', fragment='')"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 解析\n",
    "base_url = r.url\n",
    "nfu_urlparse = urllib.parse.urlparse(base_url)\n",
    "nfu_urlparse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https://www.nfu.edu.cn/mtnf/2345271ded6a42eea0333b4b3fcff916.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/5c42612d30f34c51ad0ac02f105fbb96.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/e7bb1b1f321848c293d766b56499f490.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/8e52779ca01e442c91c3464fbebfade3.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/7f58d83fe918438bbf247a3c5fac7b0c.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/cdcd599b51534ba28fad4f8339e50912.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/339fce86633446c79e6a8b641f347bfc.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/e47e419e3c074425840ecd0a3c59cd71.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/f3ea8ffe88944b049f8938c4824307ea.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/d7c765b927164fdd8b8ee552b76c16a1.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/800d761cd0ed4dc3848f354ccebcd6c3.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/944218629f714229afbbc6193daaa717.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/6ad6fe1276a04e2e81c00a27550cc5de.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/9bf3244bff9843caba2c5adc33f3b831.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/ef93603c2a4d40598036e67c807fb369.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/95af98c22740480c8da90f85caf9f662.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/695a7e02a35344cd96df3d1231eca03e.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/d20faf2a3281484eba35757b5c4fdc5d.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/c24ab52b72b44ea4b7c4d9b7fa107530.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/470662a2289d4546955d54fc981c9cb1.htm']"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 重组链接\n",
    "list_URL  = [urllib.parse.urlunparse\\\n",
    "([nfu_urlparse.scheme,nfu_urlparse.netloc,'/'+ nfu_urlparse.path.split('/')[1] +'/' + detail_url,'','',''])\\\n",
    "for detail_url in parsed.xpath('//div[@class=\"news_title\"]/a/@href')]\n",
    "list_URL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>标题</th>\n",
       "      <th>链结</th>\n",
       "      <th>日期</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>【羊城晚报】向国内一流应用型大学进军！广州南方学院揭牌</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/2345271ded6a42eea0...</td>\n",
       "      <td>2021-03-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>【广州日报】中大南方学院转设为广州南方学院，今日正式挂牌！</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/5c42612d30f34c51ad...</td>\n",
       "      <td>2021-03-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>【新快报】建设中国一流民办大学 广州南方学院转设更名挂牌</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/e7bb1b1f321848c293...</td>\n",
       "      <td>2021-03-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>【新华网】喻世友：办一流民办高校 让大学回到大学</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/8e52779ca01e442c91...</td>\n",
       "      <td>2021-03-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>【南方+】岭南春来早！央视走进广州从化直播都市乡村的崭新气象</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/7f58d83fe918438bbf...</td>\n",
       "      <td>2021-03-02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>【羊城晚报】2020年广东高等教育遇变更强：云端课堂新活力，独立学院转设更名</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/cdcd599b51534ba28f...</td>\n",
       "      <td>2021-01-04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>【央广网】中山大学南方学院荣获“广东民办教育四十周年突出贡献机构”称号</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/339fce86633446c79e...</td>\n",
       "      <td>2020-12-22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>【南方+】广州从化粤创之夜，中大南方学院学子激情四射！</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/e47e419e3c07442584...</td>\n",
       "      <td>2020-12-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>【南方+】嘉木成林|中山大学南方学院“共青春、沐韶华”2020文艺汇演圆满举行</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/f3ea8ffe88944b049f...</td>\n",
       "      <td>2020-11-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>【南方+】增强学习内动力，关注学校新发展 ——校长午餐会谈心来了</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/d7c765b927164fdd8b...</td>\n",
       "      <td>2020-11-17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>【央广网】增强学习内动力，关注学校新发展 ——中山大学南方学院校长午餐会谈心</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/800d761cd0ed4dc384...</td>\n",
       "      <td>2020-11-17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>【人民政协网】南粤民办高校亮相高交会，彰显科技创新魅力</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/944218629f714229af...</td>\n",
       "      <td>2020-11-17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>【凤凰新闻网】南粤民办高校亮相高交会，彰显科技创新魅力</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/6ad6fe1276a04e2e81...</td>\n",
       "      <td>2020-11-17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>【羊城晚报】中国会计学会学术年会召开，总参会人数创下历史纪录</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/9bf3244bff9843caba...</td>\n",
       "      <td>2020-11-04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>【中国教育在线】中国会计学会2020学术年会在中山大学南方学院召开</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/ef93603c2a4d405980...</td>\n",
       "      <td>2020-11-03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>【央广网】中国会计学会2020学术年会在中山大学南方学院召开</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/95af98c22740480c8d...</td>\n",
       "      <td>2020-11-03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>【南方+】中山大学南方学院组织全体党员干部收看深圳经济特区建立40周年庆祝大会</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/695a7e02a35344cd96...</td>\n",
       "      <td>2020-10-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>【央广网】零突破！中山大学南方学院首获广东省普通高校重点科研平台立项</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/d20faf2a3281484eba...</td>\n",
       "      <td>2020-10-27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>【央广网】中山大学南方学院举行2020年师德教育报告会暨新聘教师入职宣誓仪式</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/c24ab52b72b44ea4b7...</td>\n",
       "      <td>2020-10-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>【中国教育在线】聆听校长第一课 | 中大南方组织学习《习近平谈治国理政》第三卷—论中国经济发展</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/470662a2289d454695...</td>\n",
       "      <td>2020-10-15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 标题  \\\n",
       "0                       【羊城晚报】向国内一流应用型大学进军！广州南方学院揭牌   \n",
       "1                     【广州日报】中大南方学院转设为广州南方学院，今日正式挂牌！   \n",
       "2                      【新快报】建设中国一流民办大学 广州南方学院转设更名挂牌   \n",
       "3                          【新华网】喻世友：办一流民办高校 让大学回到大学   \n",
       "4                    【南方+】岭南春来早！央视走进广州从化直播都市乡村的崭新气象   \n",
       "5            【羊城晚报】2020年广东高等教育遇变更强：云端课堂新活力，独立学院转设更名   \n",
       "6               【央广网】中山大学南方学院荣获“广东民办教育四十周年突出贡献机构”称号   \n",
       "7                       【南方+】广州从化粤创之夜，中大南方学院学子激情四射！   \n",
       "8           【南方+】嘉木成林|中山大学南方学院“共青春、沐韶华”2020文艺汇演圆满举行   \n",
       "9                  【南方+】增强学习内动力，关注学校新发展 ——校长午餐会谈心来了   \n",
       "10           【央广网】增强学习内动力，关注学校新发展 ——中山大学南方学院校长午餐会谈心   \n",
       "11                      【人民政协网】南粤民办高校亮相高交会，彰显科技创新魅力   \n",
       "12                      【凤凰新闻网】南粤民办高校亮相高交会，彰显科技创新魅力   \n",
       "13                   【羊城晚报】中国会计学会学术年会召开，总参会人数创下历史纪录   \n",
       "14                【中国教育在线】中国会计学会2020学术年会在中山大学南方学院召开   \n",
       "15                   【央广网】中国会计学会2020学术年会在中山大学南方学院召开   \n",
       "16          【南方+】中山大学南方学院组织全体党员干部收看深圳经济特区建立40周年庆祝大会   \n",
       "17               【央广网】零突破！中山大学南方学院首获广东省普通高校重点科研平台立项   \n",
       "18           【央广网】中山大学南方学院举行2020年师德教育报告会暨新聘教师入职宣誓仪式   \n",
       "19  【中国教育在线】聆听校长第一课 | 中大南方组织学习《习近平谈治国理政》第三卷—论中国经济发展   \n",
       "\n",
       "                                                   链结          日期  \n",
       "0   https://www.nfu.edu.cn/mtnf/2345271ded6a42eea0...  2021-03-16  \n",
       "1   https://www.nfu.edu.cn/mtnf/5c42612d30f34c51ad...  2021-03-16  \n",
       "2   https://www.nfu.edu.cn/mtnf/e7bb1b1f321848c293...  2021-03-16  \n",
       "3   https://www.nfu.edu.cn/mtnf/8e52779ca01e442c91...  2021-03-12  \n",
       "4   https://www.nfu.edu.cn/mtnf/7f58d83fe918438bbf...  2021-03-02  \n",
       "5   https://www.nfu.edu.cn/mtnf/cdcd599b51534ba28f...  2021-01-04  \n",
       "6   https://www.nfu.edu.cn/mtnf/339fce86633446c79e...  2020-12-22  \n",
       "7   https://www.nfu.edu.cn/mtnf/e47e419e3c07442584...  2020-12-08  \n",
       "8   https://www.nfu.edu.cn/mtnf/f3ea8ffe88944b049f...  2020-11-26  \n",
       "9   https://www.nfu.edu.cn/mtnf/d7c765b927164fdd8b...  2020-11-17  \n",
       "10  https://www.nfu.edu.cn/mtnf/800d761cd0ed4dc384...  2020-11-17  \n",
       "11  https://www.nfu.edu.cn/mtnf/944218629f714229af...  2020-11-17  \n",
       "12  https://www.nfu.edu.cn/mtnf/6ad6fe1276a04e2e81...  2020-11-17  \n",
       "13  https://www.nfu.edu.cn/mtnf/9bf3244bff9843caba...  2020-11-04  \n",
       "14  https://www.nfu.edu.cn/mtnf/ef93603c2a4d405980...  2020-11-03  \n",
       "15  https://www.nfu.edu.cn/mtnf/95af98c22740480c8d...  2020-11-03  \n",
       "16  https://www.nfu.edu.cn/mtnf/695a7e02a35344cd96...  2020-10-27  \n",
       "17  https://www.nfu.edu.cn/mtnf/d20faf2a3281484eba...  2020-10-27  \n",
       "18  https://www.nfu.edu.cn/mtnf/c24ab52b72b44ea4b7...  2020-10-16  \n",
       "19  https://www.nfu.edu.cn/mtnf/470662a2289d454695...  2020-10-15  "
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 输出结果\n",
    "# B-D-1 pd.DataFrame 建构，pandas课有教\n",
    "df = pd.DataFrame( {\n",
    "         \"标题\": parsed.xpath('//div[@class=\"news_title\"]/a/@title'),\n",
    "         \"链结\": list_URL,\n",
    "         \"日期\": parsed.xpath('//font[@class=\"right-more\"]/text()'),\n",
    "     } )\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B-D-2 pd.DataFrame 输出excel，pandas课有教\n",
    "df.to_excel(\"data_out/nfu_文学与传媒学院.xlsx\", sheet_name=\"检索结果\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 本周内容1:如何实现翻页？\n",
    "\n",
    "* 1. 翻页链接有何区别？\n",
    "* 2. 有多少页？\n",
    "* 3. 实现翻页的url队列\n",
    "* 4. 批量存html文件\n",
    "* 5. 批量存excel文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 翻页链接有何区别？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://www.nfu.edu.cn/mtnf/index.htm'"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 第一页\n",
    "base_url_01 = r.url\n",
    "base_url_01"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SplitResult(scheme='https', netloc='www.nfu.edu.cn', path='/mtnf/index.htm', query='', fragment='')"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urllib.parse.urlsplit(base_url_01)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>第一页</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>https</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>www.nfu.edu.cn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>/mtnf/index.htm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               第一页\n",
       "0            https\n",
       "1   www.nfu.edu.cn\n",
       "2  /mtnf/index.htm\n",
       "3                 \n",
       "4                 "
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(urllib.parse.urlsplit(base_url_01)).rename({0:\"第一页\"},axis=1)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://www.nfu.edu.cn/mtnf/index2.htm'"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 第二页\n",
    "base_url_02 = session.get('https://www.nfu.edu.cn/mtnf/index2.htm').url\n",
    "base_url_02"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>第一页</th>\n",
       "      <th>第二页</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>https</td>\n",
       "      <td>https</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>www.nfu.edu.cn</td>\n",
       "      <td>www.nfu.edu.cn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>/mtnf/index.htm</td>\n",
       "      <td>/mtnf/index2.htm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               第一页               第二页\n",
       "0            https             https\n",
       "1   www.nfu.edu.cn    www.nfu.edu.cn\n",
       "2  /mtnf/index.htm  /mtnf/index2.htm\n",
       "3                                   \n",
       "4                                   "
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['第二页'] = urllib.parse.urlsplit(base_url_02)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 有多少页？\n",
    "* 第三页...  第n页？多少页？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20\n"
     ]
    }
   ],
   "source": [
    "for i in range(1,100):\n",
    "    r = session.get('https://www.nfu.edu.cn/mtnf/index'+str(i)+'.htm')\n",
    "    if r.status_code != 200:\n",
    "        print(i)\n",
    "        break\n",
    "# so page = 19?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 实现翻页的url队列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https://www.nfu.edu.cn/mtnf/index1.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index2.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index3.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index4.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index5.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index6.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index7.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index8.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index9.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index10.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index11.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index12.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index13.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index14.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index15.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index16.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index17.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index18.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index19.htm']"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url_group = ['https://www.nfu.edu.cn/mtnf/index'+str(i)+'.htm' for i in range(1,20)]\n",
    "url_group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "url_group.insert(0,'https://www.nfu.edu.cn/mtnf/index.htm')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['https://www.nfu.edu.cn/mtnf/index.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index1.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index2.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index3.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index4.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index5.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index6.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index7.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index8.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index9.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index10.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index11.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index12.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index13.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index14.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index15.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index16.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index17.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index18.htm',\n",
       " 'https://www.nfu.edu.cn/mtnf/index19.htm']"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url_group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/mtnf/index.htm'"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "urllib.parse.urlparse(url_group[0]).path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 批量存html文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "for url in url_group:\n",
    "    r = session.get(url)\n",
    "#     print(r.html.html)\n",
    "    path = urllib.parse.urlparse(url).path\n",
    "    with open ('html_out/'+path, encoding = \"utf8\", mode = \"w\") as fp:\n",
    "        fp.write(r.html.html)\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 批量存excel文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "# xpath 准备：\n",
    "dict_xpath = {\n",
    "    '链接_xpath':'//div[@class=\"news_title\"]/a/@href',\n",
    "    '标题_xpath':'//div[@class=\"news_title\"]/a/@title',\n",
    "    '日期_xpath':'//font[@class=\"right-more\"]/text()'\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pages_content_url(parsed):\n",
    "    list_URL  = [urllib.parse.urlunparse\\\n",
    "                 ([nfu_urlparse.scheme,nfu_urlparse.netloc,'/'+ nfu_urlparse.path.split('/')[1] +'/' + detail_url,'','',''])\\\n",
    "                 for detail_url in parsed.xpath(dict_xpath['链接_xpath'])]\n",
    "    return list_URL\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['index13.htm', 'index12.htm', 'index10.htm', 'index11.htm', 'index.htm', 'index9.htm', 'index15.htm', 'index14.htm', 'index8.htm', 'index16.htm', 'index17.htm', 'index6.htm', 'index7.htm', 'index5.htm', 'index19.htm', 'index18.htm', 'index4.htm', 'index1.htm', 'index3.htm', 'index2.htm']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>标题</th>\n",
       "      <th>链结</th>\n",
       "      <th>日期</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>81</th>\n",
       "      <td>1</td>\n",
       "      <td>【广州日报】中大南方学院转设为广州南方学院，今日正式挂牌！</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/5c42612d30f34c51ad...</td>\n",
       "      <td>2021-03-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>0</td>\n",
       "      <td>【羊城晚报】向国内一流应用型大学进军！广州南方学院揭牌</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/2345271ded6a42eea0...</td>\n",
       "      <td>2021-03-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82</th>\n",
       "      <td>2</td>\n",
       "      <td>【新快报】建设中国一流民办大学 广州南方学院转设更名挂牌</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/e7bb1b1f321848c293...</td>\n",
       "      <td>2021-03-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83</th>\n",
       "      <td>3</td>\n",
       "      <td>【新华网】喻世友：办一流民办高校 让大学回到大学</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/8e52779ca01e442c91...</td>\n",
       "      <td>2021-03-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84</th>\n",
       "      <td>4</td>\n",
       "      <td>【南方+】岭南春来早！央视走进广州从化直播都市乡村的崭新气象</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/7f58d83fe918438bbf...</td>\n",
       "      <td>2021-03-02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>280</th>\n",
       "      <td>0</td>\n",
       "      <td>【信息时报】中大南方学院资助学生环保创业</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/52f42ac444a549d8ab...</td>\n",
       "      <td>2015-03-22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281</th>\n",
       "      <td>1</td>\n",
       "      <td>【新华网】中山大学南方学院院长讲授新学期思想政治“第一课”</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/f7e9a4ca40d14ba592...</td>\n",
       "      <td>2015-03-19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>282</th>\n",
       "      <td>2</td>\n",
       "      <td>【信息时报】大学生开学讲家风</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/3df40478b3f148bfa0...</td>\n",
       "      <td>2015-03-02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>283</th>\n",
       "      <td>3</td>\n",
       "      <td>【信息时报】中大南方学院“逸仙文库”开放</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/7bb296d40a77479da8...</td>\n",
       "      <td>2014-12-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>284</th>\n",
       "      <td>4</td>\n",
       "      <td>【南方都市报】中大南方学院将试行按照学分缴纳学费</td>\n",
       "      <td>https://www.nfu.edu.cn/mtnf/bcbd4e99d29347bfb9...</td>\n",
       "      <td>2014-10-31</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>385 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     index                              标题  \\\n",
       "81       1   【广州日报】中大南方学院转设为广州南方学院，今日正式挂牌！   \n",
       "80       0     【羊城晚报】向国内一流应用型大学进军！广州南方学院揭牌   \n",
       "82       2    【新快报】建设中国一流民办大学 广州南方学院转设更名挂牌   \n",
       "83       3        【新华网】喻世友：办一流民办高校 让大学回到大学   \n",
       "84       4  【南方+】岭南春来早！央视走进广州从化直播都市乡村的崭新气象   \n",
       "..     ...                             ...   \n",
       "280      0            【信息时报】中大南方学院资助学生环保创业   \n",
       "281      1   【新华网】中山大学南方学院院长讲授新学期思想政治“第一课”   \n",
       "282      2                  【信息时报】大学生开学讲家风   \n",
       "283      3            【信息时报】中大南方学院“逸仙文库”开放   \n",
       "284      4        【南方都市报】中大南方学院将试行按照学分缴纳学费   \n",
       "\n",
       "                                                    链结          日期  \n",
       "81   https://www.nfu.edu.cn/mtnf/5c42612d30f34c51ad...  2021-03-16  \n",
       "80   https://www.nfu.edu.cn/mtnf/2345271ded6a42eea0...  2021-03-16  \n",
       "82   https://www.nfu.edu.cn/mtnf/e7bb1b1f321848c293...  2021-03-16  \n",
       "83   https://www.nfu.edu.cn/mtnf/8e52779ca01e442c91...  2021-03-12  \n",
       "84   https://www.nfu.edu.cn/mtnf/7f58d83fe918438bbf...  2021-03-02  \n",
       "..                                                 ...         ...  \n",
       "280  https://www.nfu.edu.cn/mtnf/52f42ac444a549d8ab...  2015-03-22  \n",
       "281  https://www.nfu.edu.cn/mtnf/f7e9a4ca40d14ba592...  2015-03-19  \n",
       "282  https://www.nfu.edu.cn/mtnf/3df40478b3f148bfa0...  2015-03-02  \n",
       "283  https://www.nfu.edu.cn/mtnf/7bb296d40a77479da8...  2014-12-26  \n",
       "284  https://www.nfu.edu.cn/mtnf/bcbd4e99d29347bfb9...  2014-10-31  \n",
       "\n",
       "[385 rows x 4 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "list_df = []\n",
    "\n",
    "\n",
    "files= os.listdir('html_out/mtnf/')\n",
    "print(files)\n",
    "\n",
    "for html in files:\n",
    "    with open('html_out/mtnf/'+html,encoding='utf8',mode='r') as fp:\n",
    "        html_load = fp.read()\n",
    "        parsed = requests_html.soup_parse(html_load)\n",
    "        list_URL = pages_content_url(parsed)\n",
    "        \n",
    "        df = pd.DataFrame( {\n",
    "         \"标题\": parsed.xpath(dict_xpath['标题_xpath']),\n",
    "         \"链结\": list_URL,\n",
    "         \"日期\": parsed.xpath(dict_xpath['日期_xpath']),\n",
    "        } )\n",
    "        list_df.append(df)\n",
    "\n",
    "        \n",
    "        \n",
    "df_all = pd.concat(list_df).reset_index().sort_values(by='日期',ascending=False)\n",
    "display(df_all)    \n",
    "\n",
    "with pd.ExcelWriter('data_out/nfu_官网.xlsx',mode='w',engine=\"openpyxl\") as writer:  \n",
    "            df_all.to_excel(writer, sheet_name='媒体报道')\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 课后实践，增加sheet，完成一下数据抓取\n",
    "\n",
    "* 学校要闻\n",
    "* 校园动态\n",
    "* 通知公告\n",
    "* 招投标\n",
    "* 高教动态\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 下周预习目标\n",
    "## 使用 xpath 应用 [m.liepin.com](https://m.liepin.com/zhaopin/)\n",
    "\n",
    "你是数据科学家，这m.liepin.com有什么样的牛肉，你想怎么抓？\n",
    "* 工作名称\n",
    "* 工作地点\n",
    "* 工作\n",
    "* ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "336px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
